\(^1\) GenomEast platform, IGBMC

1 - Training set up

1.1 - Datasets analyzed during this training

For this training, we will use two datasets:

  • datasets produced by Achour et al Pubmed. In this project they analyzed transcriptomics (RNA-seq) and epigenomics (ChIP-seq) data in the striatum of Huntington’s disease mice. We will focus on the RNA-seq data.

The data are publicly available in GEO under the accession number GSE59572. It contains two subseries:

1.2 - Tools used during this training

1.2.1 - Galaxy

Bioinformatics tools will be run through the french instance of Galaxy, Galaxy France in order to analyzed the data.

1.2.2 - IGV

The Genome browser IGV will be used to visualize the data in a genomics context.

1.2.3 - Biojupies

Biojupies will be used to run the differential expression analysis.

2 - Prepare your Galaxy environment

Galaxy is a tool that allow users to run bioinformatics tools on a high performance computing cluster through a simple web interface. We are going to use the french instance of Galaxy, Galaxy France.

2.1 - Log in to Galaxy

Go to Galaxy France website: https://usegalaxy.fr/ and log in with your personal account.

2.2 - Import a public history that contains data that will be analyzed during this training

Data analyzed during this training are available in a public history: https://usegalaxy.fr/u/stephanie/h/neuro-epigenetics-training-data. Import this history.

2.2.1 - Browse to the history named “Neuro-epigenetics training”

2.2.2 - Import the history

2.2.3 - Create a new working history

2.2.4 - Name the new history “Neuro-epigenetics training”

2.2.5 - Import raw data (fastq files) from the imported history to the newly created history “Neuro-epigenetics training”

The datasets are in the imported history “Imported: Neuro-epigenetics training (data)”.

  • Click on the down sided arrow on the top right of your history panel and select “Show History Side-by-Side”

  • Drag and drop the datasets R6_1_387_St.chr19.fastq.gz and WT_320_St.chr19.fastq.gz from imported history to the working one.

3 - Analysis of RNA-seq data

Analysis of RNA-seq data will be run with the following steps:

  • [Galaxy] Quality controls
  • [Galaxy] Mapping
  • [Galaxy] Generation of visualization tracks
  • IGV Visualization of the data
  • [not done] Generation of per gene counts matrix
  • [Biojupies] Differential expression analysis

3.1 - Quality controls

Tool: FastQC

Website: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Citation: Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Use of the tool: It is used to assess the quality of high throughput sequencing data. The tool takes raw sequencing data (fastq files) or mapping results (BAM, SAM files) and generates a HTML report that gives a quick impression with summary graphs of the quality of the data.

3.1.1 - Search for the term “fastqc” in the top left search field and click on the tool name “FastQC”

3.1.2 - Run Fastqc on the WT sample (WT_320_St.chr19.fastq.gz)

3.1.3 - Do the same on the R6/1 sample

3.2 - Mapping

Tool: STAR

Documentation: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Citation: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PMID: 23104886; PMCID: PMC3530905.

Use of the tool: It map RNA-seq data to the reference genome really fast. It uses known transcript junction information to align reads but can also discover new splice junction events.

3.2.1 - Add the dataset that describe transcripts structure (Mus_musculus.NCBIM37.67_UCSConlychr.gtf) to your current history

  • The dataset is in the imported history “Imported: Neuro-epigenetics training (data)”.

  • Click on the down sided arrow on the top right of your history panel and select “Show History Side-by-Side”

  • Drag and drop the dataset Mus_musculus.NCBIM37.67_UCSConlychr.gtf from imported history to working one.

3.2.2 - Run STAR to map the reads to the genome

3.2.3 - Import the two datasets WT_320_St.chr19.bam and R6_1_387_St.chr19.bam to your working history.

As mapping is a long processing step, mapping data are provided in the imported history “Imported: Neuro-epigenetics training (data)”.

  • Click on “Show history options” > “Show History Side-by-Side”
  • Drag and drop the two datasets from the imported history to the working one.

3.3 - Generation of visualization tracks

Tool: Deeptools bamCoverage

Documentation: https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html

Citation: Ramírez, Fidel, Devon P. Ryan, Björn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S. Richter, Steffen Heyne, Friederike Dündar, and Thomas Manke. deepTools2: A next Generation Web Server for Deep-Sequencing Data Analysis. Nucleic Acids Research (2016). doi:10.1093/nar/gkw257.

Use of the tool: This is suite of tools is meant to handle next generation sequencing data especially ChIP-seq and RNA-seq data. Some tools can create plots useful to have global views at the data.

3.3.1 - Use the tool bamCoverage to generate comparable signal tracks from mapping data

3.4 - Visualization of the data

3.4.1 - Download mapping files

Do it for the two files WT_320_St.chr19.bam and R6_1_387_St.chr19.bam

3.4.2 - Download results files (results from bamCoverage)

Do it for the two result datasets.

3.4.3 - Launch IGV, select the assembly: mm9

3.4.4 - Load the bam files and the bigwig files

Note: file bam.bai should be in the same directory as bam files otherwise they won’t be loaded!

In IGV menu:

Select bam files and bigwig files.

You should get:

3.4.5 - Go to chromosome 19

3.4.6 - Select the two bigwig tracks

3.4.6.1 - Set them to the same scale using + select Group Autoscale

3.4.6.2 - Set the windowing function to Maximum + select Maximum

3.4.7 - Go to Syt12 gene

3.5 - Generation of per gene counts matrix

3.5.1 - The matrix of read counts per gene is available in GEO website

It has been downloaded from GEO. It is available in the file data/GSE59571_S13113_readCounts.xlsx. We are going to run a differential expression analysis on these data.

3.6 - Differential expression analysis

3.6.1 - Use the matrix in the tool Biojupies to run a differential expression analysis

Tool: Biojupies

Website: https://maayanlab.cloud/biojupies/

Documentation: https://maayanlab.cloud/biojupies/help

Citation: Torre D, Lachmann A, Ma’ayan A. BioJupies: Automated Generation of Interactive Notebooks for RNA-Seq Data Analysis in the Cloud. Cell Syst. 2018 Nov 28;7(5):556-561.e3. doi: 10.1016/j.cels.2018.10.007. Epub 2018 Nov 14. PMID: 30447998; PMCID: PMC6265050.

Use of the tool: BioJupies is a web application that enables the RNA-seq data analyses. Through an intuitive interface, users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, gene expression tables, or fetch data from >9,000 published studies containing >300,000 preprocessed RNA-seq samples.

3.6.2 - Start an analysis with Biojupies

The created notebook is available here: https://maayanlab.cloud/biojupies/notebook/3bDxb3Opy or click to run the analysis report.

3.6.3 - Download the list of deregulated genes, is Syt12 significantly deregulated?